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Abstract 

Logistic regression is usually applied to investigate the association between inherited genetic variants and a binary 
disease phenotype. A limitation of standard methods used to estimate the parameters of logistic regression models 
is their strong dependence on a few observations deviating from the majority of the data. 
We used data from the Genetic Analysis Workshop 18 to explore the possible benefit of robust logistic regression 
to estimate the genetic risk of hypertension. The comparison between standard and robust methods relied on the 
influence of departing hypertension profiles (outliers) on the estimated odds ratios, areas under the receiver 
operating characteristic curves, and clinical net benefit. 

Our results confirmed that single outliers may substantially affect the estimated genotype relative risks. The ranking 
of variants by probability values was different in standard and in robust logistic regression. For cutoff probabilities 
between 0.2 and 0.6, the clinical net benefit estimated by leave-one-out cross-validation in the investigated sample 
was slightly larger under robust regression, but the overall area under the receiver operating characteristic curve 
was larger for standard logistic regression. The potential advantage of robust statistics in the context of genetic 
association studies should be investigated in future analyses based on real and simulated data. 



Background 

Hypertension is a common chronic medical condition 
characterized by elevated arterial blood pressure. High 
blood pressure is associated with an increased risk of 
stroke, heart attack, and other serious diseases. Age, gen- 
der, tobacco smoking, alcohol consumption, and high 
body mass index constitute established risk factors for 
hypertension [1]. A genetic component has also been pos- 
tulated. It has been shown that individuals with a family 
history of hypertension have on average a higher blood 
pressure than individuals without a family history. Yanek 
et al found a 44% higher prevalence of hypertension in sib- 
lings of affected persons than in the general reference 
population [2]. In a Canadian study, standardized risk 
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ratios of hypertension were higher for first-degree relatives 
than for spouses of probands with hypertension [3]. In 
genetic studies, a large number of polymorphisms has 
been associated with hypertension and validated in inde- 
pendent collectives; 14 loci have been identified (as 
of 2010) and many genetic studies are currently in 
progress [4-8]. 

The relationship between inherited genetic polymorph- 
isms and a binary response variable (with/ without hyper- 
tension) can be investigated using logistic regression 
models that simultaneously consider the effects of multi- 
ple risk factors. Standard methods used to estimate the 
parameters of logistic regression models-for example, 
iteratively reweighted least squares-are limited by their 
dependence on a few observations departing from the 
majority of the data. This contrasts with the purpose of 
genetic risk models that aim to predict a particular health 
outcome that holds for the bulk of individuals, and to 
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identify persons with a deviating high risk of disease. We 
use data from the Genetic Analysis Workshop (GAW18) 
to explore the possible benefit of robust parameter esti- 
mates in logistic regression models for the genetic pre- 
diction of hypertension risk. 

Methods 

The analysed data (real phenotypes) were derived from 
142 unrelated individuals who participated in the San 
Antonio Family Heart or Family Diabetes/Gallbladder 
studies. Longitudinal information on hypertension, age, 
gender, and current tobacco smoking was measured up 
to 4 times per individual; the present analyses relied on 
the first available measurement. Further information is 
provided in several articles [9-12]. 

The original data was filtered according to the following 
criteria: (a) at least 1 measurement with complete informa- 
tion on hypertension and age, (b) monomorphisms were 
excluded and each polymorphism had to be represented 
by at least 2 individuals, (c) individuals with more than 5% 
missing genotypes were excluded, and, finally, (d) variants 
with missing data in any individual were removed. 

The relationship between hypertension and age, gender, 
and current tobacco smoking was first investigated by 
tests. Covariates significantly associated at the 5% confi- 
dence level entered the intercept-only model to build the 
baseline model. Subsequently, standard logistic regression 
(iteratively reweighted least squares) was used to identify 
possible hypertension-associated single-nucleotide poly- 
morphisms (SNPs) with minimal deviance, taking into 
account associated covariates. The deviance is defined as 
minus twice the logarithm of the likelihood. Genotypes 
were coded according to an additive penetrance model; 
that is, 0, 1, and 2. Departing observations (outliers) 
according to standard logistic regression were identified 
based on the Cook's distance in the baseline model. The 
Cook's distance for observation 1 is defined as 



Di = 



qMSE 

where yl denotes the full regression model prediction 
for observation j, Yjfi) represents the regression model 
prediction for observation j estimated omitting observa- 
tion i, and MSB indicates the mean square error of the 
regression model with explanatory variables. 

To investigate the possible benefit of robust parameter 
estimates in logistic regression, model coefficients were 
also estimated by solving 

n n 

J2 * (Vi- /^i) = I] V (yj; Mi) w (Xi) Mi' -a{fi) = 0 



where vij-,; im) = —ttt, with the Pearson residuals 

fi and the Huber function 



ri for liil < c 

csign(ri) for |ri| > c, 



with hii the i"^ dia gonal element of the matrix 

This estimator is based on a quasi-likelihood, asymptoti- 
cally normally distributed and Fisher consistent [13]. The 
objective of the Huber function is to down weight the 
influence of outliers and to assign inliers the usual weight. 
Variable selection under robust logistic regression relied 
on the minimal quasideviance as described by Cantoni 
and Ronchetti, which is a robust test statistic for model 
selection [13]. The quasideviance between 2 nested models 
is defined as 



qm {yu fii) - X! (y^' 1^') 



where Qm {n, m)= \ v (y„ t) w{xi)dt -->".,/ e[v (y,, t) w (x,)] dt 
with s such that v (y„ s) = 0 and t such that 
E[v(yj,£)] = 0 and the estimated linear predictor /t is 
associated to the estimate p of P and fi is associated to 
yg which is the estimate of 0). Linkage disequili- 
brium was not accounted for during variant selection 
neither for standard logistic regression nor for robust 
logistic regression. 

Our comparison of the performance of standard and 
robust logistic regression was based on different statis- 
tics. First, standard and robust estimates of age effects 
were used to exemplify the potential influence of depart- 
ing observations. Because of a different handling of out- 
liers, it was expected that different age-genotype models 
were selected under standard and robust logistic regres- 
sion. Consequently, the areas under the receiver operat- 
ing characteristic curves (AUCs) were subsequently 
compared in order to investigate the discriminative per- 
formance of the selected models. Comparisons were 
conducted for the complete data set and after exclusion 
of potential outliers. 

In addition, concordance, sensitivity, specificity, clini- 
cal net benefit, and AUCs were estimated for age- 
genotype models using a leave-one-out cross-validation 
approach [14]. Concordance was defined as the propor- 
tion of correctly estimated hypertension statuses using 
several cutofi' values for the predicted affection probabil- 
ity. The clinical net benefit (NB) was defined by 



NB(c) 



True — positive counts False — positive counts c 
Sample size Sample size 1 — c 

= Sensitivity • {% Hypertensive) — (1 — Specificity) ■ (% Normotensive) • ^ ^ ^ 
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where c is the chosen threshold for allocating an indi- 
vidual to the cases based on the logistic regression prob- 
ability estimate. Note that the net benefit depends on 
the hypertension prevalence in the study population. 
The standard and robust logistic regression models were 
also compared based on the integrated discrimination 
index (IDI) estimated by cross-validation 



IDI : 



(-j^ ricases -j^ ricontr 
E Prob, i - ^ E Prob, 
A'cases 'Acontr 

(■j^ Incases -j^ r^contr 

- E Pstand, i - - - E 
'icases j_j i'contr j_j 



Pstand, j 



where Prob,;, Prob,j. Pstand, b and Pstand, j denote the 
probability estimates from the robust and standard 
logistic regression models for cases and controls [15]. 
This index represents the difference in the discrimina- 
tion slopes of the 2 compared models. A positive IDI 
indicates that the robust model discriminates better 
between hypertensive and normotensive individuals than 
the standard model. Statistical analyses were carried out 
using the statistical language R, version 2.15.1 [16]. 

Results 

tests revealed no influence of gender {p = 0.95) and 
tobacco smoking {p = 1.00) on hypertension risk. 
Hence, only age was included in the logistic regression 
models as covariate. Filter criteria resulted in 130 indivi- 
duals (43 cases and 87 controls) with complete genotype 
and phenotype information. The age of the individuals 
ranged between 20 and 95 years with a median age of 
52 years. The total number of measured SNPs on chro- 
mosome 3 in the investigated GAW18 data set was 
35,045. 

A plot of Cook's distances under the age-only stan- 
dard logistic regression model revealed several observa- 
tions (Figure 1) that departed from the majority of the 
sample. Considering a threshold of 0.05 for the Cook's 
distance, 4 observations could be defined as outliers. 
Information on disease status and age of deviating indi- 
viduals is shown in Table 1. Individuals 62, 58, and 24 
were older than 80 years and normotensive. Individual 
number 60 was affected by the condition early in life, at 
38 years of age. Table 1 shows the influence of the 4 
identified outliers on standard and robust parameter 
estimates of age effects. For example, the exclusion of 
individual 62 resulted in an 11.2% increase of the excess 
risk of hypertension per year according to standard 
logistic regression, compared to a 7.8% increase for 
robust logistic regression. Table 2 shows the odds of 
hypertension by age interval. 
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Figure 1 Cook's distances from tlie age-only standard logistic 
regression model. The 4 most prominent outliers are indicated by 
their observation number. 



Standard logistic regression identified SNP rs3934103 
located in the ULK4 gene as the variant that most 
improved the model fit. Robust logistic regression iden- 
tified SNP rsll918360 in RP11-408H1.3 as the variant 
with the strongest association signal. Under both stan- 
dard and robust regression, model selection clearly 
favored the 2 identified SNPs as represented in Figure 2. 
The pairwise between SNP rs3934103 and SNP 
rsll918360 was 0.003. 

Table 3 shows the influence of the 4 outliers on the 
AUCs from the standard and robust logistic regression 
models. Robust and standard AUCs for the age-only 
models were identical. For the age-genotype models, the 
AUCs were slightly smaller and also slightly less outlier- 
dependent for robust logistic regression than for stan- 
dard logistic regression. 

Table 4 summarizes the results from the leave-one-out 
cross-validation. The concordance was better for the 
robust logistic regression model at every cutoff probabil- 
ity. Both models allocated best at probability 0.5 and 
almost identically at probability 0.3 (the investigated 
population included 43 cases and 87 controls; that is 
33% hypertension prevalence). At a probability of 0.3, 
sensitivities were identical and the specificity was slightly 
higher under robust regression. Standard and robust 
estimates showed similar discriminative performances 
supported by an IDI of -0.07 at every cutoff probability. 
AUCs were also almost identical. The clinical net benefit 
was slightly larger for the robust logistic regression 
model in the probability range between 0.2 and 0.6. 
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Table ^ Estimated odds ratios per year of age 



Excluded HTN Age Standard logistic regression Robust logistic regression 

individuals 









OR-Age (95% CI) 


% Change 


OR-Age (95% CI) 


% Change 


None 






1.085 (1.050, 1. 


121) 


ref. 


1.084 (1.048, 1 


.122) 


ref. 


62 


0 


9023 


1.095 (1.057, 1. 


133) 


+ 11.2% 


1.091 (1.052, 1 


.131) 


+7.8% 


58 


0 


87.66 


1.094 (1.056, 1. 


132) 


+ 10.0% 


1.091 (1.052, 1 


.131) 


+7.9% 


60 


1 


38.44 


1.091 (1.054, 1. 


128) 


+6.5% 


1.089 (1.051, 1 


.128) 


+5.1% 


24 


0 


80.27 


1.091 (1.054, 1. 


128) 


+6.6% 


1.091 (1.052, 1 


.131) 


+7.6% 


Odds ratios (ORs) 


were estimated based 


on standard 


and robust logistic reg 


ression 


models for the complete set of individuals and after exclusion of the 4 most 



remarkable outliers. 
HTN: Hypertension. 



Table 2 Overall odds of hypertension per age interval 



Age interval (number of cases-to-controls) 


<39.0 (1:22) 
0.05 


[39.0, 46.0) (2:20) [46.0, 56.2) (9:23) 
0.10 0.39 


>56.2 (31:22) 
1.41 



Age intervals were defined by the age quartiles in controls. 




T I I I I I T 



0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 

-lOgioipvalue.tandard) 

Figure 2 Quantile-quantile plots from the age-genotype standard and robust logistic regression models. The 2 selected SNPs are 
indicated by their reference SNP ID number 



Discussion 

Present results confirmed that single individuals (1/130 
= 0.8% of the observations) with a departing risk of 
hypertension may substantially affect the overall risk 
estimates in the baseline model, causing up to an 11.2% 



change in the estimated excess risk of hypertension per 
year according to standard logistic regression in the pre- 
sent exercise. 

The identification of outliers is relatively straightforward 
using routine diagnostic plots, but outlier management is 
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Table 3 Area under the receiver operating characteristic curve (AUC) 

Excluded Standard logistic regression Robust logistic regression 

individuals 



AUC-Age AUC-Age + SNP AUC-Age AUC-Age + SNP 

(% Change) (% Change) (% Change) (% Change) 



None 


0.811 


(ref.) 


0.852 


(ref.) 


0.811 


(ref) 


0.843 


(ref) 


62 


0.820 


+1.1% 


0.861 


+1.1% 


0.820 


+1.1% 


0.852 


+1.0% 


58 


0.820 


+1.1% 


0.861 


+1.1% 


0.820 


+1.1% 


0.853 


+1.2% 


60 


0.825 


+1.7% 


0.859 


+0.9% 


0.825 


+1.7% 


0.851 


+0.9% 


24 


0.819 


+1.0% 


0.859 


+0.9% 


0.819 


+1.0% 


0.844 


+0.0% 



AUCs were calculated for the complete set of individuals and after exclusion of the 4 most remarlable outliers. The relative contributions of the variables age 
and SNP {rs3934103 and rs11918360, respectively) are also shown. 



Table 4 Concordance, sensitivity, specificity, clinical net benefit, and overall AUCs. 



Probability cutoff 




Standard logistic regression 






Robust logistic regression 






Concordance 

N (%) 


Sensitivity 


Specificity Clinical net benefit 


Concordance 

N (%) 


Sensitivity 


Specificity Clinical net benefit 


0.0 


43 (33.1) 


1.00 


0.00 


0.33 


43 (33.1) 


1.00 


0.00 


0.33 


0.1 


79 (60.8) 


0,95 


0,44 


0.27 


82 (63,1) 


0.88 


0.51 


0.26 


0.2 


90 (69.2) 


0.86 


0.61 


0.22 


97 (74.6) 


0.86 


0.69 


0.23 


0.3 


98 (75.4) 


0.81 


0.72 


0.19 


99 (76.2) 


0.81 


0.74 


0.19 


0.4 


98 (75.4) 


0,70 


0,78 


0.13 


102 (78.5) 


0.72 


0.82 


0.16 


0.5 


101 (77.7) 


0.60 


0.86 


0.11 


107 (82.3) 


0.67 


0.90 


0.15 


0.6 


97 (74.6) 


0,40 


0.92 


0.05 


102 (78.5) 


0.51 


0.92 


0.09 


0.7 


99 (76.2) 


0,35 


0,97 


0.06 


100 (76.9) 


0.42 


0.94 


0.05 


0.8 


93 (71.5) 


0,19 


0.98 


0.00 


97 (74.6) 


0.30 


0.97 


0.01 


0.9 


91 (70.0) 


0,12 


0.99 


-0.03 


93 (71.5) 


0.19 


0.98 


-0,08 


1.0 


87 (66.9) 


0,00 


1.00 




87 (66.9) 


0.00 


1.00 





AUC 0.835 0.830 



These characteristics rely on the age-genotype models for standard and robust logistic regression estimated based on leave-one-out cross-validation. 



extremely challenging. For example, the specification of 
thresholds for outlier definition is often arbitrary. Robust 
statistics aim to generate estimates that hold for the 
majority of the population using complete data. The 
unequal weighting of outliers by standard and robust 
regression resulted in prediction models that included dif- 
ferent genetic variants. 

Although robust estimates of age effects and AUCs for 
age-genotype models were less sensitive to outliers than 
standard estimates in the investigated sample, cross-vali- 
dation AUCs based on standard and robust logistic 
regression, as well as IDI, were almost identical. The 
other investigated performance characteristics (concor- 
dance, sensitivity, specificity, and clinical net benefit) 
were equal or better for robust logistic regression 
around the probability that reflects the case-control 
ratio. 

The standard logistic regression model selected 1 var- 
iant in the ULK4 gene. It was previously shown that var- 
iants in this gene are associated with hypertension 
[4,17]. Among others, 4 variants (rs2272007, rs3774372, 
rsl716975, rsl052501) mentioned in the 2 publications 



were also genotyped in the GAW18 collective, and we 
found them to be in linkage disequilibrium (r^ values 
0.83, 0.73, 0.83, and 0.83, respectively) with the asso- 
ciated SNP rs3934103. 

Conclusions 

Preliminary findings suggest some advantage of robust 
statistics in the context of genetic association studies. 
However, present results were limited to a given sample 
size, as well as to particular genetic effect sizes and pro- 
portions of outliers. Additional analyses based on both 
real data and more general simulated scenarios should 
be conducted to validate initial findings. 
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